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The paper of Vladimir Koltchinskii has been circulating around for several 
years and already has become an important reference in statistical learning 
theory. One of the main achievements of the paper (further abbreviated 
as [VK]) is to propose very general techniques of proving oracle inequalities 
for excess risk under a control of the variance, that is, for example, under 
conditions (6.1) or (6.2) (often called margin or low noise conditions) or 
similar assumptions in terms of L2-diameters Dp{T,6) and other related 
characteristics. These conditions lead to fast rates for the excess risk, that 
is, to rates that are faster than n~^/^. The setup in [VK] is classical: methods 
based on empirical risk minimizers (ERM) are studied under the bounded 
loss functions. 

My comments and questions will be mainly about optimality of the ex- 
cess risk bounds. This issue is not at all obvious, even in the case where 
the underlying class J- is finite. We assume in what follows that either 
^ = {/i, . . . , /m}, where fj are some functions on 5, or this class is a convex 
hull T = conv{/i, . . . , /m}- Such classes T are used in aggregation problems 
where the functions fj are viewed either as "weak learners" or as some pre- 
liminary estimators constructed from a training sample which is considered 
as frozen in further analysis. 

Let Zi, . . . , Z„ be i.i.d. random variables taking values in a space Z, with 
common distribution P, and denote by J-q the space where the fj live. 
Consider a loss function Q : ^ x JPq — > M and the associated risk 

i?(/) = EQ(Z,/) 

assuming that the expectation 'EQ{Z, f) is finite for all f & J-q where Z has 
the same distribution as Zi. Introduce two oracle risks: Rms = niini<j<A-/ R{fj) 
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corresponding to model selection- type aggregation (MS-aggregation), and 
Rc = ™f/econv{/i,...,/A/} ^(/) corresponding to convex aggregation (C-aggre- 
gation). The excess risk of a statistic /„(Zi, . . . , Z„) is defined by 

£{fn) = E{R{fn)} - ROR 

where the oracle risk .RoR equals either Rms or Rc- A natural question 
about optimality is the following: how to find an estimator /„ for which 
the excess risk is as small as possible? Of course, this question cannot be 
answered simultaneously for all distributions P. However, optimality can be 
treated in a minimax sense: introduce a class V of distributions and call en,M 
optimal rate of aggregation if 

infsup<5(r„) X en,M 
Gm 

where Qm = {{P, fi, ■ ■ ■ , fhi) ■ P £V, fj £ J^o} and the infimum is taken over 
all estimators T„. An estimator /„ is declared to be optimal if it achieves 

sup£:(/„) < Cen,M 
Gm 

for some constant C independent of n and M. Optimal rates of aggregation 
are known for several important special cases [6]: for instance, if J^q is the 
class of all functions bounded in absolute value by a given constant, in 
Gaussian (or bounded) regression model with squared loss the optimal rates 
are 

M/n, for C-aggregation if M < ^/n, 



' — \og(—^ + 1 ) ' foi' C-aggregation if M > y^. 



(1) en,M ^ < 

y n \\/n 
(log M)/n, for MS-aggregation, 

and optimal procedures fn attaining these rates are available [6]. 

The paper [VK] suggests very general bounds on probabilities of devia- 
tions of R{fn) — Rqk where is an empirical risk minimizer. Clearly, these 
bounds can be applied to evaluate the expected risk £{fn) and to check 
whether attains optimality (at least for the bounded regression model). 
I think that this should be the case for C-aggregation, probably under some 
more assumptions on n and M, but in general not for MS-aggregation. Fur- 
thermore, presumably no selector, that is, no procedure that chooses only 
one of the M functions as estimator, can achieve the MS-rate given in (1) un- 
der strictly convex loss. On the other hand, MS-optimality can be achieved 
by estimators /„ that are convex mixtures of /i , . . . , /m with data-dependent 
coefficients. A simple aggregation method of this kind called mirror averag- 
ing [3, 4] is defined as follows. 
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Let A*^ = {6 = (0(1), . . . , ^(^■^)) : 0^^^ > 0, Ej£i ^^^'^ = 1} be the unit sim- 
plex in M*^, and let G:M*'^ — > A*'^ be a function satisfying certain assump- 
tions [3]. A possible choice of G is 

^, , _ , exp(-z(i)) exp(-zW^ 



E^£iexp(-.a))'---'E^£iexp(-zO));' 

z = {z^^\ . . . , z^^^). This particular function G (corresponding to the Gibbs 
distribution) will be considered in what follows. To any z £ M*^ we associate 
its "mirror image" in the simplex A^^, that is, a probability vector G{z/(3) 
where /3 > is a tuning parameter. 

For any 9 = (^(i) , . . . , 0(*^)) G A^^ set fg = J^jLi ^^^^ fj and assume that 
Q(Z, fe) is differentiable w.r.t. 6 with gradient 'VoQ{Z,fg). Given two se- 
quences of positive numbers f3i and 7^, the mirror averaging (MA) algorithm 
is defined as follows: 

• i = 0: initialize values Co G ^^'^ , Oo G 6*0 = 0, 

• for i = 1, . . . , n, iterate: 

Ci = Ci-i + 7iVeQ{Zi,fgy (gradient descent) 
9i = G{Ci/Pi) (mirroring) 



output On and set /„, = f^ 



(averaging) 



Remark that the vector of weights On belongs to the simplex A^^, so that fn 
is a convex mixture of initial functions (estimators) fj with data-dependent 
weights. The following theorem proved in [3] shows that the MA estimator 
satisfies a sharp oracle inequality. 

Theorem 1 (Convex aggregation). Let Q{Z,fg) be convex on A^ 
for all Z £ Z and 

(2) snpE\\VeQ{Z,fe)\L<Q* 

where || • ||oo is the sup-norm in M*^. Then the mirror averaging algorithm 
with appropriate Pi and 7^ outputs fn that satisfies 

(3) K{R{fn)}-Rc<2^ 
for alln> 1, M > 2. 
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If Q* does not depend on M, the rate of convergence on the right-hand 
side of (3) is optimal for M comfortably larger than ^/n [of. (1)]. Although 
this is not explicitly stated in [VK], it seems that a similar result can be 
obtained for the ERM /„ if Q is strictly convex in 6 (condition (7.5) of [VK]) 
using the techniques of Sections 7 and 8. In particular, the second statement 
of Theorem 13 covers the case of squared loss Q. It would be interesting 
to compare these developments to (3) and to check whether, for a general 
class of convex functions, the ERM or MA estimators achieve optimality 
in the zone M < ^/n where the bound of Theorem 1 is suboptimal. Note 
that, in a difference with [VK], Theorem 1 is not restricted to bounded loss 
functions or to loss functions with bounded gradient. Moment conditions on 
the components of the gradient suffice, but lead to coarser bounds where Q* 
grows with M. 

A particular instance of the MA algorithm can be used to mimic the MS- 
oracle with sharp bounds on the excess risk. It is called the linearized mirror 
averaging (LMA) algorithm and is defined in the same way as MA, with the 
only difference that the gradient descent step is modified as follows [4]: 

Ci = Ci-i + Ui where Ui = {Q{Zi, fi), . . . ,Q{Zi, fM)V ■ 

Thus, LMA is a special case of mirror averaging associated to the "surrogate" 
linear risk Q^{Z, 6) = e^u{Z) where u{Z) = {Q{Z, /i), . . . , Q{Z, /m))^. Two 
special cases of LMA, for the regression with squared loss and for density 
estimation with Kullback loss, have been studied earlier (cf. [2] and the 
references therein). 

To state a general excess risk bound for LMA, introduce the random vari- 
able uj taking values 1, . . . ,M with the distribution P defined conditionally 
on (Zi, . . . , Zn) by P{uj = j) = 9n^ where 9n^ is the jth component of 
The expectation corresponding to P is denoted by E. The following bound 
is proved in [4]. 

Theorem 2 (MS). Let On he the output of LMA algorithm with f3i = 
/3 > 0, 7j = 1, and let the loss function Q be such that 

-Q{Z,B[iu])-Q{Z,uj) 



(4) Elog Eexp 



<0, 



/5 

where E denotes the expectation w.r.t. the joint distribution of n + 1 i.i.d. 
random variables {Z\ , . . . , Zn, Z). Then 

plogM 



(5) lE{^(/n)} - Rus < 



n + 1 



Condition (4) is satisfied for loss functions Q that are "in the average" 
(or approximately, up to a set in Z of small measure) strongly convex in 9; 
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several sufficient conditions for (4) can be found in [4]. The most simple of 
them is concavity of the mapping 

(6) ,„Eexp(«£:^2_«M) 

on the simplex A^^ for any fixed 9' G A^^ . We will say that a loss function is 
nice if it satisfies (4). In contrast to the convex aggregation bound considered 
in Theorem 1, inequality of the form (5) for the excess risk is not obtained 
for empirical risk minimizers, and I conjecture that it is not true for them 
without additional strong restrictions on P such as -Rms = or the margin 
assumption with parameter k=1. 

Note that on the right-hand side of (5) we have the optimal rate of 
MS-aggregation, which proves that the LMA procedure is rate optimal 
for nice loss functions. To see how sharp the bound (5) is, consider the 
classification model with convex loss. Let Z = {X, Y) where X is 
a random predictor and Y G { — 1,1} is a random label. Assume that we 
have M classifiers fj -.W^ — > [—1, 1], j = 1, . . . ,M. Consider the loss function 
Q{Z,f) = {p{—Y f{X)) where (/9:]R^M4. is a convex twice differentiable 
function. The associated risk is the c^-risk of classification: 

(7) R{f)=E^{-Yf{X)). 

Then the mapping (6) is concave if {ip'{x))'^ < f3ip"{x), V |2;| < 1. This im- 
plies, for example, that inequality (5) holds for R of the form (7) with rather 
sharp constants: /3 = e if if{x) = (exponential boosting) and P = elog2 if 
<f{x) = log2(l + e^) (logit boosting). It would be interesting to study whether 
these constants can be improved by any estimation method. 

Finally, let me mention some other open problems related to optimality 
of excess risk bounds. 

(I) Theorem 1 holds for convex loss and Theorem 2 for nice (essentially, 
strongly convex) loss. What are optimal excess risk bounds for other 
loss functions? 

(II) What are optimal excess risk bounds over restricted classes of under- 
lying distributions, for example, under control of the variance (such as 
the low noise, or margin, assumption)? 
(Ill) Theorems 1 and 2 deal with two simple classes finite classes and 
their convex hulls. How to treat general classes J- 7 What are optimal 
rates of aggregation [analog of (1)] for general J- 7 

Theorem 12 in [VK] gives an insight into (III). It considers the class J- which 
is a convex hull of a ^-dimensional set. Instead of the number of functions M 
(in our case), the key parameter in Theorem 12 is the metric dimension or 
the VC-dimension V. Theorem 12 gives only an upper bound. How optimal 
is it? 
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Some first results for the problems (I) and (II) have been recently obtained 
by Lecue [5]. He considers classification setup as stated above, with ip being 
either the hinge loss or the indicator loss, for the class of distributions P 
satisfying the margin assumption (cf. (6.2) in [VK]) with exponent k > 1, 
and he suggests aggregate classifiers /* such that 



where R is either the hinge risk or the probability of misclassification, -Rms is 
the corresponding MS-oracle risk, i?^ is the risk of Bayes classifier and C > 
is a constant. Furthermore, [5] proves a minimax lower bound showing that 
the expression in square brackets in (8) plays the role of optimal rate, analo- 
gous to en,M- It is interesting that, in contrast to the bounds of Theorems 1 
and 2, here the optimal rate depends not only on n and M, but also on the 
difference between the oracle risk and the risk of Bayes classifier. Note also 
that, for the hinge risk, C-aggregation is identical to MS-aggregation since 
for classifiers taking values in [—1, 1] we have -Rms = -^c- The optimal rate of 
MS-aggregation cannot be as fast as for nice loss functions [cf. (1)], except 
for the most favorable case where k=1. These remarks show that what we 
should expect to get in (I) and (II) is quite different from the previously 
obtained results. 

My last question falls somewhat apart from the above discussion. Consider 
again the classification problem under the margin condition with exponent 
K > 1, and assume that the regression function r] belongs to a class of func- 
tions with the Loo log-covering number of the order p > (such as a 
Holder or Sobolev class). The last assumption is natural when plug-in classi- 
fiers, in particular, the SVM or boosting-type ones are studied. The optimal 
rate of convergence of the excess Bayes risk under these assumptions is a 
(potentially fast) rate of the order Tpn — n 2k-i+p{k-i) ^ij^ The ERM classi- 
fiers attaining this rate suggested in [1] are based on Loo-covering of the set 
of regression functions ij, while the argument in [VK] uses L2-covering of 
the set of indicators /(•) = /{??(•) > 1/2}, which apparently leads to slower 
rates. Can this argument be extended to prove that the ERM attains the 
optimal rate V'n? 
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